1 Introduction

Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.

In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:

\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)

you get this:

\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)

with \(g(\mu)\) or \(\eta\) symbolizing the link function.

Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.

2 Logistic Regression

Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.

When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.

3 Examples

Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)

Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)

Example 2: Probability of Red Clay (Evans & Hartemink, 2014)

Example 2: Probability of Red Clay (Evans & Hartemink, 2014)

Example 3: Probability of Ponding (NRCS Unpublished)

Example 3: Probability of Ponding (NRCS Unpublished)

4 Exercise

Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real exercise, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.

With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.

4.1 Load packages

To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.

library(aqp)     # specialized soil classes and functions
library(soilDB)  # NASIS and SDA import functions
library(raster)  # guess
library(sf)      # vector data functions
library(mapview) # interactive mapping
library(ggplot2) # graphing
library(tidyr)   # data manipulation
library(caret)   # classification and regression training
library(car)     # additional regression tools

4.2 Read in data

Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist.

# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))

# Examine the makeup of the data we imported from NASIS
str(pedons, max.level = 2)
## Formal class 'SoilProfileCollection' [package "aqp"] with 11 slots
##   ..@ idcol       : chr "peiid"
##   ..@ hzidcol     : chr "phiid"
##   ..@ hzdesgncol  : chr "hzname"
##   ..@ hztexclcol  : chr "texture"
##   ..@ depthcols   : chr [1:2] "hzdept" "hzdepb"
##   ..@ metadata    :'data.frame': 1 obs. of  2 variables:
##   ..@ horizons    :'data.frame': 5163 obs. of  54 variables:
##   ..@ site        :'data.frame': 1220 obs. of  122 variables:
##   ..@ sp          :Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   ..@ diagnostic  :'data.frame': 2161 obs. of  4 variables:
##   ..@ restrictions:'data.frame': 0 obs. of  0 variables

5 Exploratory analysis

5.1 Data wrangling

Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.

# Check consistency of argillic horizon population

# get the site table
s <- site(pedons) 

# tabulate the number of argillic horizons observed
table(s$argillic.horizon, useNA = "ifany") 
## 
## FALSE  TRUE  <NA> 
##   790   263   167
# or

# summary(s$argillic.horizon) 

# Extract argillic presence from the taxonomic subgroup
s$argillic <- grepl("arg", s$taxsubgrp)

table(s$argillic, useNA = "ifany")
## 
## FALSE  TRUE 
##  1022   198

Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.

d <- diagnostic_hz(pedons)
d_sub <- subset(d, featkind == "argillic horizon" & featdept < 50)

s$argillic.horizon50 <- ifelse(s$peiid %in% d_sub$peiid, TRUE, FALSE)

table(s$argillic.horizon50, useNA = "ifany")
## 
## FALSE  TRUE 
##   998   222

5.2 Geomorphic data

Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).

# Landform vs argillic presence

# Subset
s_sub <- subset(s, argillic == TRUE)

# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub, 
             table(landform, argillic, useNA = "ifany")
             )
# Subset and print landform.string with > 3 observations
test[test > 3,]
##              alluvial fans                 fan aprons 
##                          6                         19 
## fan aprons on fan remnants               fan remnants 
##                          4                         72 
##                      hills                 hillslopes 
##                         15                         28 
##                  low hills            mountain slopes 
##                          5                          8 
##                  mountains                  pediments 
##                          4                          9 
##                       <NA> 
##                          7
# generalize the landform.string
s$landform_generic <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform), "fan", "hill") 

Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.

# Hillslope position

# Subset fan landforms
s_sub <- subset(s, landform_generic == "fan") 

# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
  prop.table(table(hillslopeprof, argillic, useNA = "ifany"), 2)
  * 100)
  ) 
##              argillic
## hillslopeprof FALSE TRUE
##     summit       16   39
##     shoulder      4    2
##     backslope    14   20
##     footslope     2    1
##     toeslope     16    4
##     <NA>         48   34
# Slope shape
with(s_sub, round(
  prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
  * 100)
  )
##                  argillic
##                   FALSE TRUE
##   concave concave     1    0
##   concave convex      0    0
##   concave linear      4    3
##   convex concave      0    1
##   convex convex       7    7
##   convex linear       6    9
##   linear concave      6    1
##   linear convex      21   32
##   linear linear      44   38
##   linear NA           0    0
##   NA NA              11   10

Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.

# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform_generic == "fan", select = c(argillic, bedrckdepth, slope, elev, surface_total_frags_pct)) 

# convert s_sub to wide data format
s_w <- tidyr::gather(s_sub, key = key, value = value, - argillic) 
head(s_w, 2)
##   argillic         key value
## 1    FALSE bedrckdepth    NA
## 2    FALSE bedrckdepth    11
library(ggplot2)

ggplot(s_w, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ key, scale = "free")

Looking at our numeric variables only depth to bedrock seems to show much separation between the presence/absence of argillic horizons.

5.3 Soil Scientist Bias

Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.

# Custom function to filter out the top 3 soil scientists
s <- within(s, {
  old = descname
  descname2 = NA
  descname2[grepl("Stephen", old)] = "Stephen" # least senior
  descname2[grepl("Paul",    old)] = "Paul"
  descname2[grepl("Peter",   old)] = "Peter"   # most senior
  })
s_sub <- subset(s, landform_generic == "fan")

# By frequency
with(s_sub, table(descname2, argillic, useNA = "ifany"))
##          argillic
## descname2 FALSE TRUE
##   Paul       77   28
##   Peter     268   29
##   Stephen    66   13
##   <NA>      157   45
# By proportion
with(s_sub, round(
  prop.table(table(descname2, argillic), margin = 1)
  * 100)
  )
##          argillic
## descname2 FALSE TRUE
##   Paul       73   27
##   Peter      90   10
##   Stephen    84   16

For fan landforms, none of the soil scientists seem more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.

5.4 Plot coordinates

Where do our points plot? We can plot the general location in R, but for this task we will export them to a Shapefile, so we can view them in a proper GIS, and really inspect them. Notice in the figure below the number of points that fall outside the survey boundary. What it doesn’t show is the points that may plot in the Ocean or Mexico!

library(sf)

idx <- complete.cases(s$x_sdt, s$y_std) # create an index to filter out pedons with missing coordinates# Convert soil profile collection to a spatial object
s_sub <- s[idx, ]
s_sf <- st_as_sf(s_sub,
                 coords = c("x_std", "y_std"),
                 crs = 4326
                 )
# reproject
s_sf <- st_transform(s_sf, crs = 5070) # reproject

# Read in soil survey area boundaries
ssa   <- read_sf(dsn = "D:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs")
ca794 <- subset(ssa, areasymbol == "CA794") # subset out Joshua Tree National Park
ca794 <- st_transform(ca794, crs = 5070)

# Plot
library(mapview)

mapview(ca794, fill = NA) +
  mapview(s_sf)
# Write shapefile of pedons
write_sf(s_sf, dsn = "C:/workspace2", "ca794_pedons", driver = "ESRI Shapefile", delete_dsn =  TRUE)

5.4.1 Exercise 1: View the data in ArcGIS

  • Examine the shapefile in ArcGIS along with our potential predictive variables (hint classify the Shapefile symbology using the argillic horizon column)
  • Discuss with your group, and report your observations or hypotheses

5.5 Extracting spatial data

Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.

library(raster)

# set file path
folder <- "D:/geodata/project_data/R8-VIC/ca794/"

# list of file names
files <- c(
  z      = "ned30m_8VIC.tif", # elevation
  slp    = "ned30m_8VIC_slope5.tif", # slope gradient
  asp    = "ned30m_8VIC_aspect5.tif", # slope aspect
  twi    = "ned30m_8VIC_wetness.tif", # topographic wetness index
  twi_sc = "ned30m_8VIC_wetness_sc.tif", # transformed twi
  ch     = "ned30m_8VIC_cheight.tif", # catchment height
  z2str  = "ned30m_8VIC_z2stream.tif", # height above streams
  mrrtf  = "ned30m_8VIC_mrrtf.tif", # multiresolution ridgetop flatness index
  mrvbf  = "ned30m_8VIC_mrvbf.tif", # multiresolution valley bottom flatness index
  solar  = "ned30m_8VIC_solar.tif", # solar radiation
  precip = "prism30m_8VIC_ppt_1981_2010_annual_mm.tif", # annual precipitation
  precipsum = "prism30m_8VIC_ppt_1981_2010_summer_mm.tif", # summer precipitation
  temp   = "prism30m_8VIC_tmean_1981_2010_annual_C.tif", # annual temperature
  mast   = "mast30m_ca794_2013.tif", # mean annual soil temperature
  pc     = "landsat30m_8VIC_pc123456.tif", # principal components of landsat
  tc     = "landsat30m_8VIC_tc123.tif", # tasseled cap components of landsat
  k      = "gamma30m_8VIC_namrad_k.tif", # gamma radiometrics signatures
  th     = "gamma30m_8VIC_namrad_th.tif",
  u      = "gamma30m_8VIC_namrad_u.tif",
  cluster = "cluster152.tif", # unsupervised classification
  geo    = "sgmc30m_8-VIC_geology.tif"
  )

# combine the folder directory and file names
geodata_f <- paste0(folder, files) 
names(geodata_f) <- names(files)

# Create a raster stack
geodata_r <- stack(geodata_f)

# Extract the geodata and add to a data frame
data <- raster::extract(geodata_r, as(s_sf, "Spatial"), sp = TRUE)@data

# Modify some of the geodata variables
data <- within(data, {
  cluster  = factor(cluster)
  geo      = factor(geo)
  twi_sc   = abs(twi - 13.8) # 13.8 = twi median
  })

# save(data, ca794, pedons, file = "C:/workspace2/github/ncss-tech/stats_for_soil_survey/data/ch7_data.Rdata")

5.6 Examine spatial data

With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).

# Load data
# githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
# load(url(githubURL))
load(file = "C:/workspace2/github/ncss-tech/stats_for_soil_survey/data/ch7_data.Rdata")
train <- data

# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic 
train <- transform(
  train, 
  argillic  = ifelse(grepl("arg", taxsubgrp) & train$mrvbf > 0.15, "yes", "no"),
  ch_log  = log(ch + 1),
  z2str_log = log(z2str + 1),
  ch    = NULL,
  z2str = NULL
  )
idx <- which(names(train) == "z")
train <- train[idx:ncol(train)]

data_m <- tidyr::gather(train[- c(25:26)], key = key, value = value, - argillic)
data_m <- subset(data_m, !is.na(value))

ggplot(data_m, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ key, scales = "free")

6 Constructing the model

# stats

fit_glm <- glm(argillic ~ z + slp + twi_sc + ch_log + z2str_log + mrrtf + solar + precip + precipsum + temp + mast + tc_1 + tc_2 + tc_3 + k + th + u + cluster, data = train, family = binomial)


# rms

library(rms)

dd <- datadist(train)
options(datadist = "dd")

fit_lrm <- lrm(argillic ~ z + slp + twi_sc + ch_log + z2str_log + mrrtf + solar + precip + precipsum + temp + mast + tc_1 + tc_2 + tc_3 + k + th + u, data = train, x = TRUE, y = TRUE)

6.1 Diagnostic

6.1.1 Residual plots

par(mfcol = c(2, 2))
plot(fit_glm)

par(mfcol = c(1, 1))

6.1.2 Multicolinearity

sqrt(vif(fit_glm))
##         z       slp    twi_sc    ch_log z2str_log     mrrtf     solar    precip 
## 12.258756  1.395986  1.238374  1.348540  1.456729  1.224915  1.734997  1.882783 
## precipsum      temp      mast      tc_1      tc_2      tc_3         k        th 
##  2.999279 13.491502  7.725608  4.188285  2.696183  3.233941  1.672204  1.815336 
##         u  cluster3  cluster4  cluster5  cluster6  cluster7  cluster8 cluster10 
##  1.922575  1.620130  1.000000  2.950404  1.187222  1.751695  1.520434  1.000001 
## cluster11 cluster12 cluster13 cluster14 cluster15 
##  1.000001  1.256096  1.070878  2.100469  2.063241

6.2 Variable Selection & model validation

set.seed(42)

# rms
## stepwise selection and validation
step_rms <- validate(fit_lrm, method = "crossvalidation", B = 10, bw = TRUE)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted   Chi-Sq d.f. P      Residual d.f. P      AIC   
##  z2str_log 0.00   1    0.9466  0.00     1   0.9466  -2.00
##  z         0.09   1    0.7681  0.09     2   0.9553  -3.91
##  mrrtf     0.12   1    0.7340  0.21     3   0.9765  -5.79
##  solar     0.15   1    0.6996  0.36     4   0.9859  -7.64
##  tc_3      0.32   1    0.5733  0.67     5   0.9844  -9.33
##  precip    0.22   1    0.6368  0.90     6   0.9893 -11.10
##  tc_1      0.51   1    0.4739  1.41     7   0.9853 -12.59
##  k         0.37   1    0.5407  1.78     8   0.9870 -14.22
##  temp      1.80   1    0.1794  3.59     9   0.9365 -14.41
##  u         2.72   1    0.0990  6.31    10   0.7888 -13.69
##  th        0.60   1    0.4394  6.91    11   0.8067 -15.09
##  mast      1.47   1    0.2248  8.38    12   0.7549 -15.62
##  ch_log    5.51   1    0.0189 13.89    13   0.3819 -12.11
##  precipsum 6.30   1    0.0121 20.19    14   0.1243  -7.81
## 
## Approximate Estimates after Deleting Factors
## 
##               Coef    S.E. Wald Z         P
## Intercept -3.64755 0.67286 -5.421 5.928e-08
## slp       -0.20025 0.03892 -5.146 2.666e-07
## twi_sc    -0.40350 0.08604 -4.690 2.738e-06
## tc_2       0.07031 0.01249  5.631 1.788e-08
## 
## Factors in Final Model
## 
## [1] slp    twi_sc tc_2
## test accuracy and error
step_rms
##           index.orig training    test optimism index.corrected  n
## Dxy           0.6942   0.6870  0.6377   0.0492          0.6449 10
## R2            0.3059   0.3024  0.2833   0.0192          0.2867 10
## Intercept     0.0000   0.0000 -0.0752   0.0752         -0.0752 10
## Slope         1.0000   1.0000  0.9932   0.0068          0.9932 10
## Emax          0.0000   0.0000  0.0190   0.0190          0.0190 10
## D             0.1597   0.1578  0.1457   0.0121          0.1476 10
## U            -0.0017  -0.0019  0.0069  -0.0088          0.0071 10
## Q             0.1615   0.1597  0.1388   0.0209          0.1406 10
## B             0.0766   0.0772  0.0796  -0.0024          0.0791 10
## g             3.7577   3.6113  3.5670   0.0443          3.7133 10
## gp            0.1275   0.1269  0.1217   0.0052          0.1223 10
## 
## Factors Retained in Backwards Elimination
## 
##  z slp twi_sc ch_log z2str_log mrrtf solar precip precipsum temp mast tc_1 tc_2
##    *   *                                          *                            
##    *   *                                                                   *   
##    *   *                                          *                            
##    *   *                                                              *        
##    *   *                                          *                            
##    *   *                                                              *        
##    *   *                                                                   *   
##    *   *                                                                   *   
##    *   *                                                                   *   
##    *   *                                                                   *   
##  tc_3 k th u
##             
##             
##             
##             
##             
##             
##             
##             
##             
##             
## 
## Frequencies of Numbers of Factors Retained
## 
##  3 
## 10
# caret
library(caret)

## cross validation parameters
train.control <- trainControl(method = "cv", 
                              number = 10, 
                              classProbs = TRUE,
                              savePredictions = TRUE,
                              summaryFunction = twoClassSummary
                              )

# stepwise selection and validation
step_caret <- train(argillic ~ z + slp + twi_sc + ch_log + z2str_log + mrrtf + solar + precip + precipsum + temp + mast + tc_1 + tc_2 + tc_3 + k + th + u,
                     data = train,
                     method="glmStepAIC", 
                     family = "binomial",
                     direction ="backward",
                     trace = FALSE,
                     trControl = train.control,
                     na.action = na.exclude
                     )

# test accuracy & error
step_caret$results
##   parameter       ROC      Sens       Spec      ROCSD    SensSD     SpecSD
## 1      none 0.8444048 0.9865934 0.09166667 0.02832703 0.0112382 0.07296625
# summary
summary(step_glm <- step_caret$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3101  -0.4771  -0.2043  -0.0186   3.2842  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.62550    3.24856  -3.271  0.00107 ** 
## slp          -0.18892    0.04016  -4.704 2.55e-06 ***
## twi_sc       -0.41232    0.08595  -4.797 1.61e-06 ***
## ch_log        0.20915    0.09008   2.322  0.02024 *  
## precipsum     0.11292    0.03882   2.909  0.00363 ** 
## mast          0.14642    0.08798   1.664  0.09607 .  
## tc_2          0.06562    0.01534   4.277 1.89e-05 ***
## th            0.06553    0.03725   1.759  0.07851 .  
## u            -0.55631    0.33661  -1.653  0.09840 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 772.49  on 1163  degrees of freedom
## Residual deviance: 566.96  on 1155  degrees of freedom
## AIC: 584.96
## 
## Number of Fisher Scoring iterations: 8

6.3 Final model & accuracy

# examine possible thresholds
train$predict <- predict(step_glm, train, type = "response")

ggplot(train, aes(x = predict, fill = argillic)) +
  geom_density(alpha = 0.5) +
  geom_vline(aes(xintercept = 0.5), lty = "dashed") +
  xlab("probability") +
  scale_x_continuous(breaks = seq(0, 1, 0.2))

train$predict <- ifelse(train$predict > 0.3, 'yes', 'no')

# Confusion Matrix
cm <- table(observed = train$argillic, predicted = train$predict)
confusionMatrix(cm, positive = "yes")
## Confusion Matrix and Statistics
## 
##         predicted
## observed  no yes
##      no  974  70
##      yes  72  48
##                                           
##                Accuracy : 0.878           
##                  95% CI : (0.8578, 0.8963)
##     No Information Rate : 0.8986          
##     P-Value [Acc > NIR] : 0.9900          
##                                           
##                   Kappa : 0.3354          
##                                           
##  Mcnemar's Test P-Value : 0.9331          
##                                           
##             Sensitivity : 0.40678         
##             Specificity : 0.93117         
##          Pos Pred Value : 0.40000         
##          Neg Pred Value : 0.93295         
##              Prevalence : 0.10137         
##          Detection Rate : 0.04124         
##    Detection Prevalence : 0.10309         
##       Balanced Accuracy : 0.66897         
##                                           
##        'Positive' Class : yes             
## 
# Deviance squared
library(modEvA)

# Deviance squared
Dsquared(step_glm)
## [1] 0.2660655
# Adjusted deviance squared
Dsquared(step_glm, adjust = TRUE)
## [1] 0.260982
# Spatially variable accuracy
library(dplyr)

temp <- train %>%
  group_by(cluster) %>%
  summarize(
    TP       = sum(predict == "yes" & argillic == "yes", na.rm = TRUE),
    FN       = sum(predict == "no"  & argillic == "yes", na.rm = TRUE),
    sensitivity = TP / (TP + FN)
    )

ggplot(temp, aes(x = cluster, y = sensitivity)) +
  geom_point()

table(train$argillic, train$cluster)
##      
##         2   3   4   5   6   7   8  10  11  12  13  14  15
##   no  162  78   3 103  80  86  30  61 136  84  40  78 106
##   yes  27   8   0  30   2   6   3   0   0   1   2  19  23
  • Discuss the variability of the predictions across the clusters, perhaps different models need to be constructed in each cluster, some clusters appear to be dominated by specific soil series, these data aren’t clean enough (nor are the series concepts usually) to model series separately, however, we could use the clusters as an additional model to attempt to separate the series. Do the hyperthermic clusters perform differently.

6.4 Model effects

# summary
summary(step_glm)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3101  -0.4771  -0.2043  -0.0186   3.2842  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.62550    3.24856  -3.271  0.00107 ** 
## slp          -0.18892    0.04016  -4.704 2.55e-06 ***
## twi_sc       -0.41232    0.08595  -4.797 1.61e-06 ***
## ch_log        0.20915    0.09008   2.322  0.02024 *  
## precipsum     0.11292    0.03882   2.909  0.00363 ** 
## mast          0.14642    0.08798   1.664  0.09607 .  
## tc_2          0.06562    0.01534   4.277 1.89e-05 ***
## th            0.06553    0.03725   1.759  0.07851 .  
## u            -0.55631    0.33661  -1.653  0.09840 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 772.49  on 1163  degrees of freedom
## Residual deviance: 566.96  on 1155  degrees of freedom
## AIC: 584.96
## 
## Number of Fisher Scoring iterations: 8
# Convert the coefficients to an odds scale, who here gambles?
exp(coef(step_glm))
##  (Intercept)          slp       twi_sc       ch_log    precipsum         mast 
## 2.428859e-05 8.278507e-01 6.621149e-01 1.232636e+00 1.119541e+00 1.157677e+00 
##         tc_2           th            u 
## 1.067823e+00 1.067723e+00 5.733234e-01
# analysis of deviance
anova(step_glm)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: .outcome
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev
## NULL                       1163     772.49
## slp        1  105.514      1162     666.98
## twi_sc     1   39.675      1161     627.30
## ch_log     1    0.002      1160     627.30
## precipsum  1   38.383      1159     588.92
## mast       1    0.200      1158     588.72
## tc_2       1   18.390      1157     570.33
## th         1    0.619      1156     569.71
## u          1    2.752      1155     566.96
# visreg
library(visreg)

par(mfrow = c(2, 2))
visreg(step_glm, scale = "response")

  • View the results in ArcGIS and examine the accuracy at individual points
  • Discuss the effects of data quality, including both NASIS and GIS
  • Discuss how the modeling process isn’t an end in itself, but serves to uncover trends, possibly generate additional questions and direct future investigations

7 Generate spatial predictions

# Custom function to return the predictions and their standard errors

library(raster)

predfun <- function(model, data) {
  v <- predict(model, data, type = "response", se.fit = TRUE)
  cbind(
    p = as.vector(v$fit),
    se = as.vector(v$se.fit)
    )
  }
  
# Generate spatial predictions
r <- predict(geodata_r, step_glm, fun = predfun, index = 1:2, progress = "text")

# Export the results
writeRaster(r[[1]], "argi.tif", overwrite = T, progress = "text")

writeRaster(r[[2]], "argi_se.tif", overwrite = T, progress = "text")
library(raster)

# argillic probability
plot(raster("C:/workspace2/argi.tif"))
plot(ca794[1], col = NA, add = TRUE)

# argillic standard error
plot(raster("C:/workspace2/argi_se.tif"))
plot(ca794[1], col = NA, add = TRUE)

# Download clipped example from Pinto Basin Joshua Tree
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/argi_pb.zip"
download.file(githubURL, destfile = "C:/workspace2/argi_pb.zip")
unzip(zipfile="C:/workspace2/argi_pb.zip", exdir="C:/workspace2")

7.0.1 Exercise 2: View the prediction in ArcGIS

  • Examine the raster predictions in ArcGIS and compare them to the Shapefile of that contains the original observations (hint classify the Shapefile symbology using the argillic column)
  • Discuss with your group, and report your observations or hypotheses

8 References

Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352

Gessler, P. E., Moore, I. D., McKenzie, N. J., & Ryan, P. J, 1995. Soil-landscape modelling and spatial prediction of soil attributes. International Journal of Geographical Information Systems, 9:421-432

Gorsevski, P. V., Gessler, P. E., Foltz, R. B., & Elliot, W. J, 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS, 10:395-415

Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.

Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons

Kempen, B., Brus, D. J., Heuvelink, G., & Stoorvogel, J. J. (2009). Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151:311-326.

Nauman, T. W., and J. A. Thompson, 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma 213:385-399. http://www.sciencedirect.com/science/article/pii/S0016706113003066

Peterson, F.F., 1981. Landforms of the basin and range province: defined for soil survey. Nevada Agricultural Experiment Station Technical Bulletin 28, University of Nevada - Reno, NV. 52 p. http://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf

Shi, X., L. Girod, R. Long, R. DeKett, J. Philippe, and T. Burke, 2012. A comparison of LiDAR-based DEMs and USGS-sourced DEMs in terrain analysis for knowledge-based digital soil mapping. Geoderma 170:217-226. http://www.sciencedirect.com/science/article/pii/S0016706111003387

9 Additional reading

Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract

James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/

Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf